Introduction

Due to the small nature of our dataset, we will build a model using all available data first, while attempting to be careful to avoid overfitting, and then formally evaluate it using cross validation at the end of our analysis.

Variable Selection

Our data comes from four different datasets. We used three of Riguang Wen's datasets from figshare.com -- , , and . We also used a dataset called uploaded to zenodo.org by Pablo Gomez and Sandra Giral. From these datasets, we considered the following variables.

Variable Description Type Source
Player Name of player Character
Age Age of player Numeric
G Games played Numeric
GS Games started Numeric
MP Minutes played Numeric
PER Player efficiency rating Numeric
PTS Points Numeric
X3PAr 3PA/FGA Numeric
FTr FTA/FGA Numeric
TS True shooting percentage Numeric
ORB Offensive rebounds Numeric
DRB Defensive rebounds Numeric
TRB Total rebounds Numeric
AST Assists Numeric
STL Steals Numeric
BLK Blocks Numeric
TOV Turnovers Numeric
USG Usage percentage Numeric
ORtg Offensive rating Numeric
DRtg Defensive rating Numeric
OWS Offensive win shares Numeric
DWS Defensive win shares Numeric
WS Win shares Numeric
WS.48 Win shares per 48 minutes Numeric
OBPM Offensive box +/- Numeric
DBPM Defensive box +/- Numeric
BPM Box +/- Numeric
VORP Value over replacement player Numeric
Pos Position Factor
Ht Height in inches Numeric
Wt Weight in pounds Numeric
PwrSix Power Six College? Indicator
International International Player? Indicator
Salary Salary in dollars Numeric

Immediately we can recognize that some variables are functions of others and therefore do not need to be considered. Specifically, , so there is no need to include in our model. Similarly, and , so we can exclude and from consideration if we include , , and in our model.

data_init <- read.csv("data/PlayerData.csv")[,-1] %>% mutate(logsal = log(SALARY))
data <- data_init %>% dplyr::select(logsal, Age, G, gs_adj, MP, PER, X3PAr, ftr_adj, orb_adj, DRB., AST., stl_adj, blk_adj, TOV., USG., ORtg, DRtg, ows_adj, dws_adj, WS.48, OBPM, DBPM, vorp_adj, Pos_cat, Height, pts_adj, TS., International, Conference)
attach(data)

From our EDA, we concluded that a log transformation of the response variable, salary, would be appropriate based on the skewed distribution.

We also noticed that many of the covariates had skewed distributions, or did not have linear marginal relationships with the response log(salary). We attempted to make these variables approximately normally distirbutied, with an approximately linear marginal relationship with the response. We experimented with log, square root, and squaring transformations, and ended up considering the following transformations in our model:

We can also see that Position has some redundancy and maybe too much granularity; for example "F-C" (forward/center) and "C-F" (center/forward) are treated as different positions by the model when functionally they are the same. We cleaned this up by grouping into "Guards", "Wings", and "Bigs". From the graph it is unclear if there is a significant relationship between the refined position predictor and log(salary).

pos_box = data_init %>% ggplot(aes(x = Pos, y = log(SALARY))) + geom_boxplot() + ggtitle("log(salary) by position (unrefined)")
pos_cat_box = data_init %>% ggplot(aes(x = Pos_cat, y = log(SALARY))) + geom_boxplot() + ggtitle("log(salary) by position (refined)")
grid.arrange(pos_box, pos_cat_box, nrow = 1)

Below we used VORP as an example; we can see the distribution of VORP and its plot against log(salary) pre and post transformation:

It's not perfect, but the transformed variable appears to be much more appropriate for a linear model than the raw variable. Once our predictors are appropriately transformed, we consider all pairwise correlations between covariates as an initial search for possibly collinearity.

corrplot(cor(data[sapply(data, is.numeric)]))

Four pairs of predictors with noticably large correlations according the the correlation matrix are further investigated. We learn that and have a correlation coefficient of \(0.947\), and have a correlation coefficient of \(0.888\), and have a correlation coefficient of \(0.864\), and finally and have a correlation coefficient of \(-0.760\). To avoid redundancy in the model, we'll remove one predictor in each of the four pairs from the model. The terms' variance inflation factors will guide the decision regarding which predictor to drop. Using GVIF\(^\frac{1}{2df}\) will allow the GVIFs to be comparable across dimensions. Thus, we remove (8.77 > 6.25), (4.64 > 4.29), (8.42 > 7.42), and (4.29 > 3.55).

full = lm(logsal ~ ., data = data)
vif(full)[,3]
##           Age             G        gs_adj            MP           PER 
##      1.119102      3.000854      1.860765      6.240509      7.400986 
##         X3PAr       ftr_adj       orb_adj          DRB.          AST. 
##      2.495987      1.442517      2.554112      2.256841      3.072991 
##       stl_adj       blk_adj          TOV.          USG.          ORtg 
##      1.858169      2.364577      2.885804      5.059147      6.446025 
##          DRtg       ows_adj       dws_adj         WS.48          OBPM 
##      4.268697      3.069610      4.543134      8.393912      6.126404 
##          DBPM      vorp_adj       Pos_cat        Height       pts_adj 
##      3.542653      3.301813      1.577793      2.394156      8.743606 
##           TS. International    Conference 
##      4.631713      1.113687      1.081437
data <- data_init %>% dplyr::select(logsal, Age, G, gs_adj, MP, PER, X3PAr, ftr_adj, orb_adj, DRB., AST.,
                                    stl_adj, blk_adj, TOV., USG., ORtg, ows_adj, dws_adj,
                                    OBPM, DBPM, vorp_adj, Pos_cat, Height, International,
                                    Conference)

We will start by looking at a full model with all remaining predictors. We observe that in this case, Age, MP, vorp_adj, and Multi-team affiliation have very significant relationships with the log of salary, even after accounting for the other predictors. Other variables that have moderately significant relationships with the log of salary even after accounting for the other predictors are G, orb_adj, USG., and ows_adj.

#all_models = regsubsets(SALARY ~ ., force.in = 1, data = data, nbest = max(choose(n_predictors, 0:n_predictors)), really.big = T, nvmax = 13)
null = lm(logsal ~ 1, data = data)
full = lm(logsal ~ ., data = data)
summary(full)
## 
## Call:
## lm(formula = logsal ~ ., data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.12733 -0.46812  0.04338  0.55034  1.86549 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       8.7021392  2.9814207   2.919 0.003736 ** 
## Age               0.0658064  0.0106113   6.202 1.54e-09 ***
## G                -0.0181663  0.0055382  -3.280 0.001139 ** 
## gs_adj            0.0194856  0.0475846   0.409 0.682422    
## MP                0.0007226  0.0001837   3.933 0.000101 ***
## PER              -0.0871335  0.0465252  -1.873 0.061907 .  
## X3PAr             0.2832213  0.5158578   0.549 0.583327    
## ftr_adj          -0.4945672  0.3990974  -1.239 0.216078    
## orb_adj           0.3410770  0.1307090   2.609 0.009450 ** 
## DRB.             -0.0136510  0.0159678  -0.855 0.393174    
## AST.              0.0189639  0.0094084   2.016 0.044586 *  
## stl_adj           0.1826712  0.2372052   0.770 0.441750    
## blk_adj          -0.1199107  0.1824635  -0.657 0.511490    
## TOV.              0.0036909  0.0141669   0.261 0.794603    
## USG.              0.0728354  0.0251642   2.894 0.004032 ** 
## ORtg              0.0150725  0.0144326   1.044 0.297032    
## ows_adj           1.2644686  0.3969230   3.186 0.001571 ** 
## dws_adj           0.6322679  0.2681370   2.358 0.018911 *  
## OBPM              0.1038468  0.0754130   1.377 0.169360    
## DBPM              0.1310865  0.0565073   2.320 0.020913 *  
## vorp_adj         -0.9380214  0.1972369  -4.756 2.87e-06 ***
## Pos_catG         -0.4680186  0.2487174  -1.882 0.060684 .  
## Pos_catWing      -0.2256998  0.1571007  -1.437 0.151689    
## Height            0.0194139  0.0290128   0.669 0.503831    
## InternationalYes -0.0264050  0.1100976  -0.240 0.810597    
## ConferenceMulti  -0.6825231  0.1578069  -4.325 1.98e-05 ***
## ConferenceWest   -0.0928064  0.0952214  -0.975 0.330398    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8297 on 358 degrees of freedom
## Multiple R-squared:  0.557,  Adjusted R-squared:  0.5249 
## F-statistic: 17.32 on 26 and 358 DF,  p-value: < 2.2e-16

We'll next run a stepwise model search using both AIC and BIC to provide another starting point:

bic = log(nrow(data))
stepbic_model <- stepAIC(object = null, scope = list(lower = null, upper = full),
direction = "both", k = bic)
stepaic_model <- stepAIC(object = null, scope = list(lower = null, upper = full),
direction = "both", k = 2)
print(stepaic_model$call)
## lm(formula = logsal ~ MP + Age + Conference + USG. + DBPM + orb_adj + 
##     Pos_cat + AST. + vorp_adj + OBPM + G + PER + ows_adj + dws_adj, 
##     data = data)
print(stepbic_model$call)
## lm(formula = logsal ~ MP + Age + Conference + USG. + DBPM, data = data)

From this initial analysis, some important predictors appear to be MP, Age, Conference, USG., and DBPM. We can see that the 'conference' factor is really picking up on players that appeared on multiple teams. Sometimes these are good players who have been traded, but often these players are end of the bench guys that sign cheap, short term contracts. This having a relationship with salary would make sense. We created in our preprocessing script a new variable that just flags players that appeared on multiple teams within the season.

We'll look at some diagnostic plots for these models to see where we're at:

stepaic_resid_plot <- data.frame(resid=stepaic_model$residuals, fitted_logsal=stepaic_model$fitted.values) %>%  ggplot(aes(x=fitted_logsal, y=resid)) + geom_point() + ggtitle("Stepwise AIC model residual plot")
stepbic_resid_plot <- data.frame(resid=stepbic_model$residuals , fitted_logsal=stepbic_model$fitted.values) %>%  ggplot(aes(x=fitted_logsal, y=resid)) + geom_point() + ggtitle("Stepwise BIC model residual plot")

grid.arrange(stepaic_resid_plot, stepbic_resid_plot, nrow = 1)

We can see that there are some problems with these residual plots. The densest areas of each plot appear to be following a downward trend, and the variance is not constant across all fitted values, i.e., these are not null plots. The models need improvement. It is also noteworthy that despite AIC including several more predictors, the residual plots are very similar.

Added Variable Plots

Stepwise Regression Models

null = lm(logsal ~ 1, data = data)
full = lm(logsal ~ ., data = data)
summary(full)

bic = log(nrow(data))
stepbic_model <- stepAIC(object = null, scope = list(lower = null, upper = full),
direction = "both", k = bic)
stepaic_model <- stepAIC(object = null, scope = list(lower = null, upper = full),
direction = "both", k = 2)

We will build some basic models on the data after these adjustments, to get an idea of what predictors are still important:

data_trimmed <- data_init %>% dplyr::select(logsal, Age, G, gs_adj, MP, PER, X3PAr, ftr_adj, orb_adj, DRB., AST., stl_adj, blk_adj, TOV., USG., ORtg, DRtg, ows_adj, dws_adj, WS.48, OBPM, DBPM, vorp_adj, Pos_cat, Height, pts_adj, TS., International, Multiteam)
trimmed_null = lm(logsal ~ 1, data = data_trimmed)
trimmed_full = lm(logsal ~ ., data = data_trimmed)
corrplot(cor(data_trimmed[sapply(data_trimmed, is.numeric)]), method="number")

trimmed_aic <- stepAIC(object = trimmed_null, scope = list(lower = trimmed_null, upper = trimmed_full),
direction = "both", k = 2)
summary(trimmed_full)
## 
## Call:
## lm(formula = logsal ~ ., data = data_trimmed)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.97747 -0.44726  0.02749  0.55182  2.12532 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      10.7887241  5.7929454   1.862 0.063374 .  
## Age               0.0619535  0.0105083   5.896 8.69e-09 ***
## G                -0.0211991  0.0055688  -3.807 0.000166 ***
## gs_adj           -0.0041681  0.0482924  -0.086 0.931269    
## MP               -0.0001719  0.0003178  -0.541 0.588925    
## PER               0.0339671  0.0610374   0.556 0.578222    
## X3PAr             0.3452160  0.5324448   0.648 0.517171    
## ftr_adj          -0.3683558  0.4140906  -0.890 0.374308    
## orb_adj           0.1968777  0.1632128   1.206 0.228519    
## DRB.             -0.0263340  0.0160502  -1.641 0.101739    
## AST.              0.0049831  0.0138668   0.359 0.719542    
## stl_adj          -0.0978102  0.2632827  -0.372 0.710485    
## blk_adj          -0.3489851  0.1926931  -1.811 0.070972 .  
## TOV.              0.0267741  0.0249610   1.073 0.284162    
## USG.             -0.0129629  0.0417970  -0.310 0.756637    
## ORtg              0.0500782  0.0235417   2.127 0.034092 *  
## DRtg             -0.0297096  0.0450437  -0.660 0.509956    
## ows_adj           0.7769953  0.4639111   1.675 0.094838 .  
## dws_adj           0.7746716  0.3594889   2.155 0.031840 *  
## WS.48            -9.0274781  4.7809750  -1.888 0.059813 .  
## OBPM              0.0728867  0.0936530   0.778 0.436932    
## DBPM              0.1185353  0.0726130   1.632 0.103477    
## vorp_adj         -0.6613434  0.2286878  -2.892 0.004065 ** 
## Pos_catG         -0.4969772  0.2457386  -2.022 0.043886 *  
## Pos_catWing      -0.2122213  0.1554118  -1.366 0.172948    
## Height            0.0224363  0.0288497   0.778 0.437266    
## pts_adj           0.0998109  0.0362844   2.751 0.006250 ** 
## TS.              -4.2719066  3.1091010  -1.374 0.170308    
## InternationalYes -0.0032859  0.1099257  -0.030 0.976170    
## Multiteam        -0.5938917  0.1483221  -4.004 7.58e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8198 on 355 degrees of freedom
## Multiple R-squared:  0.5712, Adjusted R-squared:  0.5361 
## F-statistic:  16.3 on 29 and 355 DF,  p-value: < 2.2e-16
print(trimmed_aic$call)
## lm(formula = logsal ~ pts_adj + Age + Multiteam + DBPM + G + 
##     vorp_adj + orb_adj + blk_adj + Pos_cat + AST. + DRB. + dws_adj + 
##     X3PAr + ows_adj, data = data_trimmed)

Diagnostics

From this point, we can see that adjusted VORP, adjusted points, Age, G, and Multiteam are the most important predictors, with many others appearing to be useful. We'll look at residuals of a basic model and see where we're at:

trimmed_resid_plot <- data.frame(resid=trimmed_aic$residuals , fitted_logsal=trimmed_aic$fitted.values) %>%  ggplot(aes(x=fitted_logsal, y=resid)) + geom_point() + ggtitle("Stepwise AIC model residual plot")
trimmed_resid_plot

The residual cloud is similar to those before we cut unnecessary predictors.

Leverage Points and Cook's Distances

Now, we'll look at some of the highest leverage cases.

show_diagnostics <- function(model){
  x = model.matrix(model)
  h = x %*% solve(t(x) %*% x) %*% t(x)
  leverages = diag(h)
  lev_resid_plot <- data.frame(resid=model$residuals , fitted_logsal=model$fitted.values, leverage=leverages) %>%  ggplot(aes(x=fitted_logsal, y=resid)) + geom_point(aes(size=leverages, color=leverages)) + ggtitle("Residual plot with leverages")
  cooks = cooks.distance(model)
  cook_resid_plot <- data.frame(resid=model$residuals , fitted_logsal=model$fitted.values, cook_distance=cooks) %>%  ggplot(aes(x=fitted_logsal, y=resid)) + geom_point(aes(size=cook_distance, color=cook_distance)) + ggtitle("Residual plot with cooks distance")
  print(lev_resid_plot)
  print(cook_resid_plot)
}

show_diagnostics(trimmed_aic)

Outliers

There's an outlying case with a large leverage and cooks distance; we'll use a t-test for a mean shift to formally test if this is an outlier:

test_outlier_meanshift <- function(model, ind){
  p = length(coef(model))
  n = nrow(model.matrix(model))
  ri = rstandard(model)[ind]
  t_stat = ri * sqrt(((n - p - 1) / (n - p - (ri ^ 2))))
  pval = pt(t_stat, n - p - 1)
  return (pval)
}

test_outlier_meanshift(trimmed_aic, which.max(cooks.distance(trimmed_aic)))
##        348 
## 0.02061654

We notice that the p-value for a mean shift for this case is small. Looking at the data for this particular observation, we see that it is an outlier on many levels - this player had 2 games played, 6 minutes, a wildly high PER and Usage rate, 0 for many counting stats (Assists, steals, etc.), and a very small salary of $30000. Many of these stats are unstable due to the very small number of minutes played. Also, this player is Thanasis Antetokounmpo, who is a unique case because many people consider him to be a non-NBA level player, who only got into the league because his brother Giannis is one of the best players in the world. Thus, we feel he is not a useful data point, and are comfortable removing him from the dataset. We refit a basic model without him.

data <- data_init %>% filter(Player != "Thanasis Antetokounmpo")

data_trimmed <- data %>% mutate(logsal = log(SALARY)) %>% dplyr::select(logsal, Age, G, gs_adj, MP, PER, X3PAr, ftr_adj, orb_adj, DRB., AST., stl_adj, blk_adj, TOV., USG., ORtg, DRtg, ows_adj, dws_adj, WS.48, OBPM, DBPM, vorp_adj, Pos_cat, Height, pts_adj, TS., International, Multiteam)
trimmed_null = lm(logsal ~ 1, data = data_trimmed)
trimmed_full = lm(logsal ~ ., data = data_trimmed)
vif(trimmed_full)
summary(trimmed_full)
trimmed_aic <- stepAIC(object = trimmed_null, scope = list(lower = trimmed_null, upper = trimmed_full),
direction = "both", k = 2)
show_diagnostics(trimmed_aic)

Variance Inflation Factors

We can also use VIFs to evaluate redundancy in data:

full = lm(logsal ~ ., data = data_trimmed)
vif(full)[,3]
##           Age             G        gs_adj            MP           PER 
##      1.112267      2.987726      1.853649      6.481753      7.361902 
##         X3PAr       ftr_adj       orb_adj          DRB.          AST. 
##      2.566367      1.437117      2.780883      2.236671      3.315685 
##       stl_adj       blk_adj          TOV.          USG.          ORtg 
##      1.817906      2.356146      2.908801      5.029606      6.356921 
##          DRtg       ows_adj       dws_adj         WS.48          OBPM 
##      4.267611      3.089980      4.606512      8.263129      6.250622 
##          DBPM      vorp_adj       Pos_cat        Height       pts_adj 
##      3.421618      3.300426      1.580763      2.403358      9.664145 
##           TS. International     Multiteam 
##      4.667096      1.116171      1.086591
#full2 = lm(logsal ~ . - WS.48 - PER - ORtg - PTS, data = data) # Sequentially removed highest VIF until next iteration lowered both Adj R2 and Mult R2

Model Fitting

Test for Interaction

Test for Higher Order

Model Validation

Final Model

Discussion

Conclusion